This is a study note combining course material form IE 583 by Siggi Olafsson at ISU with some additional material. The following topic will be discussed:
classification,clustering ,association rule minning,advanced classificationFor more details on the study material see:
| Machine Learnning | Description |
|---|---|
Supervised learning (SML) |
the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative. |
Unsupervised learning (UML) |
no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data. |
Semi-supervised learning |
approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection). |
Reinforcement learning |
the learning algorithm performs a task using feedback from operating in a real of synthetic environment. |
To tune a predictive model using multiple workers, the function syntax in the \(caret\) package functions (e.g. train, rfe or sbf) do not change. A separate function is used to “register” the parallel processing technique and specify the number of workers to use.
# Parallel Processing
library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
Run the following chunk to installl all the libraries used in this note.
package_required <- c("doParallel", "caret","tidyverse","klaR","mlbench","e1071","ranger")
sapply(package_required, require, character.only=TRUE)
# Set random seed
set.seed(1234)
In this chapter, we will use function and tool in \(caret\) R package. The \(caret\) package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
# data
dataset <- read.csv("credit_g.csv")
dataset$class <- as.factor(dataset$class)
head(as_tibble(dataset))
## # A tibble: 6 x 21
## checking_status duration credit_history purpose credit_amount
## <fct> <int> <fct> <fct> <int>
## 1 <0 6 'critical/oth~ radio/~ 1169
## 2 0<=X<200 48 'existing pai~ radio/~ 5951
## 3 'no checking' 12 'critical/oth~ educat~ 2096
## 4 <0 42 'existing pai~ furnit~ 7882
## 5 <0 24 'delayed prev~ 'new c~ 4870
## 6 'no checking' 36 'existing pai~ educat~ 9055
## # ... with 16 more variables: savings_status <fct>, employment <fct>,
## # installment_commitment <int>, personal_status <fct>,
## # other_parties <fct>, residence_since <int>, property_magnitude <fct>,
## # age <int>, other_payment_plans <fct>, housing <fct>,
## # existing_credits <int>, job <fct>, num_dependents <int>,
## # own_telephone <fct>, foreign_worker <fct>, class <fct>
For more detail about this section, check the Link
# Classification and Regression trainning
library(caret)
library(tidyverse)
Note: Remember also load the specific package for the learning method to allow the train() function work.
| R Package | Description |
|---|---|
caret |
- |
ggplot2 |
- |
mlbench |
- |
class |
- |
caTools |
- |
randomForst |
- |
impute |
- |
ranger |
- |
kernlab |
- |
class |
- |
glmnet |
- |
naivebayes |
- |
rpart |
- |
rpart.plot |
- |
createDataPartitionThe function createDataPartition can be used to create balanced splits of the data. If the y argument to this function is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data.
times: create multiple splits at once; the data indices are returned in a list of integer vectors.createResample: make simple bootstrap samples.createFolds: generate balanced cross-validation groupings from a set of data.trainIndex <- createDataPartition(dataset$class, p = 0.67,list = FALSE, times = 1)
Train_set <- dataset[trainIndex, ]
Test_set <- dataset[-trainIndex, ]
trainControlThe function trainControl can be used to specifiy the type of resampling. It also generates parameters that further control how models are created:
allowParallel = TRUE: a logical that governs whether trainsearch = "random": use a random searchmethod: The resampling method (see the table: List of resampling method argument)| Resampling method | Description |
|---|---|
boot |
the usual bootstrap |
cv |
cross-validation |
LOOCV |
|
LGOCV |
(for repeated Train_set/test splits) |
repeatedcv |
|
timeslice |
|
none |
(only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models) |
oob |
out of bag |
optimism_boot |
the optimism bootstrap estimator |
boot632 |
the 0.632 bootstrap estimator |
boot_all |
all of boot, bot632, optimism_boost (for efficiency, but boot will be used for calculations). |
adaptive_cv, adaptive_boot or adaptive_LGOCV |
|
none |
(only fits one model to the entire Train_set set) |
ctrl <- trainControl(method = "boot632", savePred = T, classProb = T, number = 10)
expand.grid\(caret\) automate the tuning of the hyperparameter using a grid search,
tuneLength: that sets the number of hyperparameter values to test.tuneGrid: directly defining the hyperparameter values, which requires knowledge of the model. Find the hyperparameters from train Models By TagtrainThe train function can be used to:
preProcess allow single and multiple pre-processing methods
preProcess = "medianImpute": Imputation using median of features. This methods works well if the data are missing at random.preProcess = "knnImpute": kNN imputation will impute missing values using on other, similar non-missing rows. The default value is 5.preProcess = "scale": division by the standard deviationpreProcess = "center": subtraction of the meanpreProcess = "pca": PCA can be used as pre-processing method, generating a set of high-variance and perpendicular predictors, preventing collinearity.trControl: estimate model performance from a training settuneGrid: evaluate, using resampling, the effect of model tuning parameters on performancemethod: choose the “optimal” model across these parameters. check the train Model List| Method | Data used | Other Assumption |
|---|---|---|
| Naive Bayes | All | Attribute Independence |
| Decision tree | Only a few attributes but all the instances | - |
| K-NN | All the attributes but only a few instances | - |
1. Read the Train_set dataset.
2. Calculate the mean and standard deviation of the predictor variables in each class
3. Repeat calculate the probability of using the guess density equation each class until the probability of all
4. predictor variable has been calculate
5. Calculate the likelihood for each class
6. Get the greatest likelihood
for more info, click link
library(klaR)
# Tuning parameters
searchGrid <-expand.grid(fL = 1:4, usekernel = c(FALSE), adjust = seq(0, 10, by = 1))
# Train model
NBmodel <-train(class~., data = dataset, method = "nb", trControl = ctrl, tuneGrid = searchGrid)
# plot search grid results
plot(NBmodel)
A great advantage of decision trees is that they make a complex decision simpler by breaking it down into smaller, simpler decisions using divide-and-conquer strategy. They basically identify a set of if-else conditions that split data according to the value if the features. Decision trees choose splits based on most homogeneous partitions, and lead to smaller and more homogeneous partitions over their iterations.
An issue with single decision trees is that they can grow, and become large and complex with many branches, with corresponds to over-fitting. Over-fitting models noise, rather than general patterns in the data, focusing on subtle patterns (outliers) that won’t generalise.
1. Check for the above base cases.
2. For each attribute a, find the normalized information gain ratio from splitting on a.
3. Let a_best be the attribute with the highest normalized information gain.
4. Create a decision node that splits on a_best.
5. Recur on the sublists obtained by splitting on a_best, and add those nodes as children of node.
6. End
The plot function can be used to examine the relationship between the estimates of performance and the tuning parameters
# Tuning parameters
searchGrid <-expand.grid(C=(6:10)*0.05 ,M=(1:3)*1)
# Train model
DTmodel <-train(class~., data = dataset ,method="J48", trControl=ctrl, tuneGrid = searchGrid)
# print(DTmodel)
# summary(DTmodel)
# plot search grid results
plot(DTmodel, metric = "Accuracy")
plot(DTmodel, metric = "Kappa")
# plot DT
# plot(DTmodel$finalModel)
K nearest neighbours works by directly measuring the (euclidean) distance between observations and infer the class of unlabelled data from the class of its nearest neighbours.
1. Load the training and test data
2. Choose the value of K
3. For each point in test data:
4. find the Euclidean distance to all training data points
5. store the Euclidean distances in a list and sort it
6. choose the first k points
7. assign a class to the test point based on the majority of classes present in the chosen points
8. End
This example overfit the data in high dimension.
# Tuning parameters
searchGrid <-expand.grid(k = c(1:10))
# Train model
KNNmodel <- train(class~., data=dataset, method="knn", trControl = ctrl, tuneGrid = searchGrid)
# plot search grid results
plot(KNNmodel)
This section only apply to classification.
In supervised machine learning, we have a desired output and thus know precisely what is to be computed. It thus becomes possible to directly evaluate a model using a quantifiable and object metric. The trainning process seeks to minimise:
root mean squared error (RMSE) for regression.prediction accuracy for classification.in-sample error: lead to optimistic assessment of our model. Indeed, the model has already seen these data upon construction, and is does considered optimised the these observations in particular; it is said to over-fit the dataout-of-sample error (prefer): on new data, to gain a better idea of how to model performs on unseen data, and estimate how well the model generalises| Evaluation Method | Concept | Outcome | Data size |
|---|---|---|---|
| Independent test dataset | single split into traing/testing set | estimate of the out-of-sample error | Large data size |
| cross validation | Multiple split into traing/testing set | better estimate of the out-of-sample error | Moderate data size |
| Bootstrap | random sampling from dataset with replacement | gives a sense of the distribution | Small data size |
| Ensemble Learning | Concept | Outcome | Data size |
|---|---|---|---|
| Bagging or Bootstrap aggregation | average multiple models through Bootstrap sampling | minimizes RMSE loss | |
| Boosting | consecutively train a single model | solve for net error from the prior model to improve accuracy with some small risk of less coverage |
Randomly select a subset of the data to be Train_set data and test data
1. Let's create a random (i.e.) 80/20 split to define the test and train subsets.
2. Train a regression or classification model on the Train_set data.
3. Test the model on the Test_set data.
4. Calculating the out-of-sample RMSE or prediction accuracy.
set.seed(1234)
trainIndex <- createDataPartition(dataset$class ,p=.67,list=FALSE,times=1)
Train_set <- dataset[trainIndex,]
Test_set <- dataset[-trainIndex,]
Instead of doing a single traing/testing split, we can systematise this process, produce multiple, different out-of-sample train/test splits, that will lead to a better estimate of the out-of-sample error.
Schematic of 3-fold cross validation producing three trainning (blue) and testing (white) splits.
set.seed(42)
ctrl = trainControl(method = "cv", number = 10, verboseIter = FALSE)
LMmodel_10cv <- train(price ~ ., diamonds, method = "lm", trControl= ctrl)
p <- predict(LMmodel_10cv, diamonds)
error <- p - diamonds$price
rmse_xval <- sqrt(mean(error^2)) ## xval RMSE
rmse_xval
## [1] 1129.843
\[e_{cv} = \frac{\sum_{i=1}^{k_{fold}} e_i}{k_{fold}}\]
The idea is to draw random samples with replacement of size N from the training data with size M (M > N). This process is repeated B times to get B bootstrap datasets.
bootstrap
\[E_{boot} = weight_{train} \times e_{train} + weight_{test} \times e_{test}\] \[e_{0.632} = 0.368 \times e_{train} + 0.632 \times e_{test}\]
Schematic of the RF algorithm based on the Bagging (Bootstrap + Aggregating) method by Xiaogang HE (original source: https://www.researchgate.net/publication/309031320_Spatial_downscaling_of_precipitation_using_adaptable_random_forests
Building random forest starts by generating a high number of individual decision trees. A single decision tree isn’t very accurate, but many different trees built using different inputs (with bootstrapped inputs, features and observations) enable to explore a broad search space and, once combined, produce accurate models, a technique called bootstrap aggregation or bagging.
library(mlbench)
data(Sonar)
library(e1071)
library(ranger)
set.seed(42)
ctrl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
myGrid <- expand.grid(mtry = c(5, 10, 20, 40, 60),
min.node.size = c(5, 10, 15),
splitrule = c("gini", "extratrees"))
RFmodel <- train(Class ~ .,
data = Sonar,
method = "ranger",
tuneGrid = myGrid,
trControl = ctrl)
# RFmodel
plot(RFmodel)
| Reference Yes | Reference No | |
|---|---|---|
| Predicted Yes | TP | FP |
| Predicted No | FN | TN |
# use DTmodel as an example
prediction <- predict(DTmodel, Test_set)
confusionMatrix(prediction,Test_set$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 89 3
## good 10 228
##
## Accuracy : 0.9606
## 95% CI : (0.9336, 0.9789)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9043
## Mcnemar's Test P-Value : 0.09609
##
## Sensitivity : 0.8990
## Specificity : 0.9870
## Pos Pred Value : 0.9674
## Neg Pred Value : 0.9580
## Prevalence : 0.3000
## Detection Rate : 0.2697
## Detection Prevalence : 0.2788
## Balanced Accuracy : 0.9430
##
## 'Positive' Class : bad
##
This illustrates the need to adequately balance TP and FP rates. We need to have a way to do a cost-benefit analysis, and the solution will often depend on the question/problem.
caTools::colAUC(p, diamonds[["price"]], plotROC = TRUE)
Type of Clustering
library(factoextra)
Use petal.Length and Sepal.Length in iris dataset as the example.
library(datasets)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
To assess the clustering tendency, the Hopkins' statistic and a visual approach can be used. This can be performed using the function get_clust_tendency() in \(factoextra\) package, which creates an ordered dissimilarity image (ODI).
iris[, -5] %>% # Remove column 5 (Species)
scale() %>% # Scale variables
get_clust_tendency(n = 50, gradient = list(low = "steelblue", high = "white"))
## $hopkins_stat
## [1] 0.2002686
##
## $plot
A distance matrix is necessary for hierarchical clustering. In bioinformatics, distance matrices (also called heat map) are used to represent protein structures in a coordinate-independent manner, as well as the pairwise distances between two sequences in sequence space. They are used in structural and sequential alignment, and for the determination of protein structures from NMR or X-ray crystallography.
res.dist <- get_dist(iris[,1:4], stand = TRUE, method = "euclidean")
fviz_dist(res.dist, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
# or use heatmap() from `stats` package
heatmap(as.matrix.data.frame(iris[,1:4]), scale = "column")
In a above analysis we (correctly) assumed that there were 3 clusters. In practice this is usually unknown but we can use cluster validation to determine the number of clusters with \(NbClust\) pakcage. What NbClus() is actually doing is using different clustering methods to obtain a clustering, and then evaluating the quality based on different quality meaasures (here called index). Try another clustering method by changing to method = "single". Results are very different!
distance: The similarity notion is a key concept for Clustering, in the way to decide which clusters should be combined or divided when observing sets. An appropriate metric use is strategic in order to achieve the best clustering, because it directly influences the shape of clusters. The Dissimilarity Matrix (or Distance matrix) is used in many algorithms of Density-based and Hierarchical clustering. (see reference)min.nc: minimal number of clusters, between 1 and (number of objects - 1)max.nc: maximal number of clusters, between 2 and (number of objects - 1), greater or equal to min.nc. By default, max.nc=15.method: the cluster analysis method to be used. This should be one of: “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median”, “centroid”, “kmeans”.index: the index to be calculated. This should be one of : “kl”, “ch”, “hartigan”, “ccc”, “scott”, “marriot”, “trcovw”, “tracew”, “friedman”, “rubin”, “cindex”, “db”, “silhouette”, “duda”, “pseudot2”, “beale”, “ratkowsky”, “ball”, “ptbiserial”, “gap”, “frey”, “mcclain”, “gamma”, “gplus”, “tau”, “dunn”, “hubert”, “sdindex”, “dindex”, “sdbw”, “all” (all indices except GAP, Gamma, Gplus and Tau), “alllong” (all indices with Gap, Gamma, Gplus and Tau included).library(NbClust)
# all criteria is used to determine the optimal number of cluster
valClust <- NbClust(data=iris[,1:4], distance="euclidean", method = "ward.D", index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 3 proposed 6 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Clustering methods are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical and geo-spatial. They are different types of clustering methods, including Reference:
In \(factoextra\) package, the eclust() function makes it easier to visualize clusters using the two primary principal components. It also allows us to call multiple clustering methods using one function (similar to train in \(caret\)). check reference on Related Books: Practical Guide To Cluster Analysis in R
k: the number of clusters to be generated. If NULL, the gap statistic is used to estimate the appropriate number of clusters. In the case of kmeans, k can be either the number of clusters, or a set of initial (distinct) cluster centers.hc_method: the agglomeration method to be used: ward.D, ward.D2, single, complete, average, etc.Not sure what is the FUNcluster = c("agnes", "diana"). need to check in the future.
Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst. The option for selecting clustering function, FUNcluster for Partitioning methods are:
kmeans: K-means clustering (MacQueen 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. The K-means method is sensitive to anomalous data points and outliers.pam: K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), in which, each cluster is represented by one of the objects in the cluster. PAM is less sensitive to outliers compared to k-means.clara: CLARA algorithm (Clustering Large Applications), which is an extension to PAM adapted for large data sets.Kmean calculation
1. Choose the number of clusters (K) and obtain the data points
2. Place the centroids $c_1$, $c_2$, ..... $c_k$ randomly
3. Repeat this step until convergence or until the end of a fixed number of iterations
a) for each data point x_i:
- find the nearest centroid($c_1$, $c_2$ .. $c_k$)
- assign the point to that cluster
b) for each cluster j = 1..k
- new centroid = mean of all points assigned to that cluster
4. End
# Visualizing partitional clustering
clusters.km <- eclust(iris[,1:4], FUNcluster = c("kmeans"), k = 3, nstart = 25, graph = FALSE)
fviz_cluster(clusters.km,geom="point")
Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated.
The option for selecting clustering function, FUNcluster for Hierarchical clustering are:
single: The criteria is that distance between two clusters set equal to the minimum of distances between all instances. It results in a very elongated cluster, so good for flexible cluster shapes.complete: The criteria is that distance between two clusters set equal to maximum of all distances between instances in the clusters.It results in a tightly bound, compact clustersCheck to read about Comparing Cluster Dendrograms in R
1. Find the shortest distrance of each instance to the order, and then order them in an ascending Order
2. Cluster the instance one by one according to the order
# Visualizing hierarchical clustering (dendrogram)
clusters.si <- eclust(iris[,1:4], FUNcluster = c("hclust"), k = 3, hc_method="single", graph = FALSE)
fviz_cluster(clusters.si, geom="point")
fviz_dend(clusters.si,show_labels = FALSE, rect=TRUE)
1. find closest pairwise instances and cluster the two instances
2. compare distance between a unassigned instances and the farthest instance of each cluster or any other unassigned instances. Merge the unassigned instance to the cluster with the shortest distance untill all the instance are cluster together.
# Visualizing hierarchical clustering (dendrogram)
clusters.cl <- eclust(iris[,1:4], FUNcluster = c("hclust"), k = 3, hc_method="complete", graph = FALSE)
fviz_cluster(clusters.cl, geom="point")
fviz_dend(clusters.cl,show_labels = FALSE, rect=TRUE)
Fuzzy clustering is also known as soft method. Standard clustering approaches produce partitions (K-means, PAM), in which each observation belongs to only one cluster. This is known as hard clustering. In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The Fuzzy c-means method is the most popular fuzzy clustering algorithm. See Reference
# Visualizing hierarchical clustering (dendrogram)
clusters.fa <- eclust(iris[,1:4], FUNcluster = c("fanny"), k = 3, graph = FALSE)
fviz_cluster(clusters.fa, geom="point")
A cluster is dense area of instances where each dense area is separated by low-density areas. DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers (Ester et al. 1996). The basic idea behind density-based clustering approach is derived from a human intuitive clustering method. See Reference
1. For each point xi, compute the distance between xi and the other points. Finds all neighbor points within distance eps of the starting point (xi). Each point, with a neighbor count greater than or equal to MinPts, is marked as core point or visited.
2. For each core point, if it's not already assigned to a cluster, create a new cluster. Find recursively all its density connected points and assign them to the same cluster as the core point.
3. Iterate through the remaining unvisited points in the data set.
4. Those points that do not belong to any cluster are treated as outliers or noise.
# Load the data
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]
# Compute DBSCAN using fpc package
library(fpc)
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)
# Plot DBSCAN results
library("factoextra")
fviz_cluster(db, data = df, stand = FALSE,
ellipse = FALSE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
In section of 2.1.2 Optimal Number of Clusters, we use vote of existing criteria with NbClus() in \(NbClust\) pakcage to determin the number of cluster. However, the different criterian have their strength and weakness. Depended on dataset, knowing the nature of the criteria helps understand which validation method to use. precisely.
| criteria | Validation Method, index |
|---|---|
| Compactness | Hart, kl, hartigan, Clest, Sil, Gap, Clest |
| Disconnectivity | Lee |
| Compactness (elbow) | hartigan (Hartigan 1975), kl (Krzanowski and Lai 1988) |
| Compactness and separation | ch (Calinski and Harabasz 1974), Clest, Sil |
| Reference distribution or resampling | Gap, Clest, BH |
To evaluate the goodnesss of the clustering results, plot the silhouette coefficient as follow:
fviz_silhouette(clusters.fa, palette = "jco",
ggtheme = theme_minimal())
## cluster size ave.sil.width
## 1 1 50 0.79
## 2 2 45 0.38
## 3 3 55 0.43
Association Rule Discovery aims to discovery interesting correlation or other relationships between attributes in large databases in expression of the form \(X \Rightarrow Y\), sucha as:
In practice, association rules are useful for analyzing and predicting customer behavior. They play an important part in customer analytics, market basket analysis, product clustering, catalog design and store layout.
Association rules are created by searching data for frequent if-then patterns and using the criteria support and confidence to identify the most important relationships.
1. Find all item sets that meet minimum coverage
2. Find all rules that meet minimum accuracy
3. * Find all rules that meet minimum lift
4. Prune by
* retain more precise rule with smaller/same support or higher/same accuracy or higher/sames confidence
* retain simpler rule with higher accuracy or hihger confidence.
* Step 3 could be ignored if only comparing rules with common rhs, Y, because \(Lift(X \Rightarrow Y) = \frac{Confidence(X \Rightarrow Y)}{P(Y)} \sim Confidence(X \Rightarrow Y)\)
library(arules)
library(arulesViz)
vote <- read.csv("vote.csv")
head(as.tibble(vote))
## # A tibble: 6 x 17
## handicapped_inf~ water_project_c~ adoption_of_the~ physician_fee_f~
## <fct> <fct> <fct> <fct>
## 1 n y n y
## 2 n y n y
## 3 ? y y ?
## 4 n y y n
## 5 y y y n
## 6 n y y n
## # ... with 13 more variables: el_salvador_aid <fct>,
## # religious_groups_in_schools <fct>, anti_satellite_test_ban <fct>,
## # aid_to_nicaraguan_contras <fct>, mx_missile <fct>, immigration <fct>,
## # synfuels_corporation_cutback <fct>, education_spending <fct>,
## # superfund_right_to_sue <fct>, crime <fct>, duty_free_exports <fct>,
## # export_administration_act_south_africa <fct>, Class <fct>
Apriori Algorithm, apriori(), is developed in early 90s and is the most commonly used approach.
apriori():
parameter=list(target=)
target = "frequent itemsets"(default)target = "maximally frequent itemsets"target = "closed frequent itemsets"target = "rules" (only available for Apriori; use ruleInduction for eclat.)target = "hyperedgesets" (only available for Apriori; see references for the definition of association hyperedgesets)plotly_arules(): Plot an interactive scatter plot for association rules using plotly.findRules <-function(rules, c = 0.9, s = 0.35, l = 1.8){
good_rules = rules[quality(rules)$confidence > c]
good_rules = good_rules[quality(good_rules)$support > s]
good_rules = good_rules[quality(good_rules)$lift > l]
}
rules <- apriori(vote)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 43
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.10s].
## writing ... [482817 rule(s)] done [0.14s].
## creating S4 object ... done [0.23s].
good_rules <- findRules(rules) # Customize function
plotly_arules(good_rules, c = 0.9, s = 0.35, l = 1.8)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
inspect(head(good_rules, n=10, by = "lift"))
## lhs rhs support confidence lift count
## [1] {el_salvador_aid=y,
## Class=republican} => {physician_fee_freeze=y} 0.3586207 0.9936306 2.441973 156
## [2] {crime=y,
## Class=republican} => {physician_fee_freeze=y} 0.3563218 0.9810127 2.410963 155
## [3] {physician_fee_freeze=y,
## el_salvador_aid=y} => {Class=republican} 0.3586207 0.9285714 2.404337 156
## [4] {physician_fee_freeze=y,
## crime=y} => {Class=republican} 0.3563218 0.9226190 2.388924 155
## [5] {Class=republican} => {physician_fee_freeze=y} 0.3747126 0.9702381 2.384483 163
## [6] {physician_fee_freeze=y} => {Class=republican} 0.3747126 0.9209040 2.384483 163
## [7] {physician_fee_freeze=y,
## mx_missile=n} => {el_salvador_aid=y} 0.3517241 0.9935065 2.038563 153
## [8] {aid_to_nicaraguan_contras=n,
## crime=y} => {el_salvador_aid=y} 0.3770115 0.9879518 2.027165 164
## [9] {religious_groups_in_schools=y,
## aid_to_nicaraguan_contras=n,
## crime=y} => {el_salvador_aid=y} 0.3540230 0.9871795 2.025581 154
## [10] {anti_satellite_test_ban=y,
## aid_to_nicaraguan_contras=y,
## education_spending=n,
## Class=democrat} => {el_salvador_aid=n} 0.3586207 0.9570552 2.001534 156
Or we can subset the rule output from Apriori by defining the rhs or lhs.
# subset for rule with specification on lhs
def_rules <- subset(rules, subset = lhs %in% "Class=democrat")
# subset for rule with specification on rhs
Dem_rules <- subset(rules, subset = rhs %in% "Class=democrat")
good_Dem_rules <- findRules(Dem_rules, c = 0.9, s = 0.43, l = 1.5)
plotly_arules(good_Dem_rules)
The apriori() function can return either all itemsets, frequent itemsets, closed itemsets, association rules, or hyperedges. By default it returns rules based on frequent itemsets, and unfortunately there doesn’t seem to be a way to generate rules based on closed itemsets directly. Instead, you can start by generating the closed itemsets using the target parameter. Once you have those, you can generate the corresponding rules using the ruleInduction() function. Note that this function needs your dataset reformatted as transactional data as a parameter. From there, you can work with rules just the same as before, but the non-closed sets have already been filtered out (about half for the vote dataset).
itemsets = apriori(vote, parameter=list(target="closed frequent itemsets"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 closed frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 43
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.09s].
## filtering closed item sets ... done [0.01s].
## writing ... [46577 set(s)] done [0.01s].
## creating S4 object ... done [0.02s].
vote_trans = as(vote, "transactions")
rules = ruleInduction(itemsets, vote_trans, .8, control=list(method="apriori"))
rules
## set of 246588 rules
In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
| Category | Maximum Margin Classifier | Support Vector Classifier | Support Vector Machine |
|---|---|---|---|
| multi-dimention | YES | YES | YES |
| allowance to miss-classification | NO | YES | YES |
| non-linear boundaries | NO | NO | YES |
Check reference for the classification of SVM.
Maximum Margin Classifier is the simple way to divide your data if your data if your data is linearly (or via a plane in multi-dimensional space) separable. But due to obvious reasons it can’t be applied to the data where no clear linear boundary is possible.
Support Vector Classifier is an extension to maximum margin classifier where we allow some miss-classification to happen.
Support Vector Machine or SVM is a further extension to SVC to accommodate non-linear boundaries, so also called Non-linear classifier, which contrast to linear classifier (support vector classifier or Maximum Margin Classifier). Though there is a clear distinction between various definitions but people prefer to call all of them as SVM to avoid any complications.
Note: linear SVMs is a special case of the Neural Network, which has no hidden layers and the loss function is the hinge loss.
Suppose we are given a training dataset of \(n\) points of the form \((\vec{x}_1,y_1), ..., (\vec{x_i},y_i)\), where the \(y_i\) are either 1 or -1, each indicating the class to which the point \(\vec{x_i}\) belongs.
In the discussion of SVM, we need to clarify the concept fo linear function to avoid confusion. In mathematics, the term linear function refers to two distinct but related notions:
Check Reference for definition of linear function.
Check Reference for definition of Projection in linearalgebra.
Check Reference for Kernel method.
Orthogonalization
We want to find the maximum-margin hyperplane that divides the group of points \(\vec{x_i}\) for which \(y_i=1\) from the group of points for which \(y_{i}=-1\), which is defined. So, the margin that the distance between the hyperplane and the nearest point \(\vec{x_i}\) from either group is maximized. In the case of support-vector machines, a data point is viewed as \(p\)-dimensional vector (a list of \(p\) numbers), and we want to know whether we can separate such points with a \((p-1)\)-dimensional hyperplane.
normal vector \(\vec{\omega}\), so it satisfying: \[\vec{\omega} \cdot \vec{x} - b = 0\]During the model learning phase, the SVM aims to maximize the width of the gap between classes with two hyperplanes. The region bounded by these two hyperplanes is called the margin, and hyperplane that lies halfway between them is called the maximum-margin hyperplane. It is completely determined by those \(\vec {x}_i\) that lie nearest to it. These \(\vec{x_i}\) are called support vectors.
| Characteristic | linearly separable | linearly inseparable |
|---|---|---|
| Normalized loss function | \(\ell(y) = 0,\ 1\) | \(\ell(y) = max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}\) |
| Type of margin | hard margin | soft margin |
| Allowance of miss-classification | data must lie on the correct side of the margin | data can lie on the wrong side of the margin with hinge loss increases linearly with one-sided error |
| Optimization problem | \(\begin{eqnarray}Minimize\ \|\vec{\omega}\|\\s.t.\ y_i(\vec{\omega} \cdot \vec{x_i} - b) \ge 1\ with\ 1 \le i \le n\end{eqnarray}\) | \(\begin{eqnarray}Minimized\ [\frac{1}{n}\sum_{i=1}^n max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}] +\lambda \|\vec{\omega}\|^2\\s.t.\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \in \Re,\ for\ all\ 1 \le i \le n\end{eqnarray}\) |
Maximum-margin hyperplane (left); Hinge loss vs zero-loss (right)
If the training data is linearly separable, we can select two parallel hyperplanes that separate the two classes of data, while the distance between the hyperplane is maximized.
| Step | Mathematical expression |
|---|---|
1. The loss always yeild to 0 because no point can be misclassified: |
\(\ell(y_i;\vec{x_i},\vec{\omega},b) = \left\{\begin{array}{l}1,\ if\ y_i (\vec{\omega} \cdot \vec{x_i}-b)<0\\0,\ if\ y_i (\vec{\omega} \cdot \vec{x_i}-b) \ge 0\end{array}\right.=0, for\ y_i (\vec{\omega} \cdot \vec{x_i}-b) \ge 0\) |
2. With a normalized dataset, the minimal normalized distance for any points with labeled \(y_i = 1\) on or above each hyperplane is 1, or that for the misclass \(y_i = -1\) is -1, so it is called hard margin. The corresponding hyperplanes can be described by the equations with the hard margin: |
\(\vec{\omega} \cdot \vec{x} - b = \pm Hard\ Margin = \pm 1\) |
3. Since data is separable, the predicted class, \(y_i\), for each vector \(\vec{x_i}\) should at least be positive one as lying on the correct side of the hyperplane: |
\(Constraint:\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1-0,\ for\ all\ 1 \le i \le n\) |
4. Geometrically, since the normalized margin is 1, the distance between these two hyperplanes is twice of distance from the mid-point to one of the hyperplane (Distance from a point to a plane): |
\(\begin{array}{} Maximize\ d= 2 \times Margin = \frac{2\vec{\omega}}{\|\vec{\omega}\|^2}=\frac{2}{\|\vec{\omega}\|} \\\Leftrightarrow Minimize\ \|\vec{\omega}\|\end{array}\) |
| 5. We solve the optimization problem of maximizing the distance between the hyperplanes subject to the hard margin constraint: | \(\begin{array}{} Minimize\ \|\vec{\omega}\|\\s.t.\ y_i(\vec{\omega} \cdot \vec{x_i} - b) \ge 1-0\ with\ 1 \le i \le n\end{array}\) |
6. By solve this problem, the result of \(\vec{\omega}\) and \(b\) determine the classifier using a sign function, sgn: |
\(\vec{x} \mapsto sgn(\vec{\omega} \cdot \vec{x} -b)\) |
If the training data is linearly inseparable, we can select two parallel hyperplanes that separate the two classes of data with allowance of miss-classification, while the distance between the hyperplane is maximized.
| Step | Mathematical expression |
|---|---|
1. For data on the wrong side of the margin, the loss is proportional to the distance from the margin. Therefore, the loss is defined by the hinge loss function: |
\(\ell(y_i;\vec{x_i},\vec{\omega},b) = max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}\) |
2. With a normalized or standardized dataset, to allow two parallel hyperplanes misclassify some outlier, a soft margin penalizes any prediction \(y_i\), when it is below its hyperplane. These hyperplanes can be described by the equations with soft margin |
\(\vec{\omega} \cdot \vec{x} - b = \pm Soft\ Margin \simeq \pm 1\) |
3. Since data is inseparable, the predicted class, \(y_i\), for each vector \(\vec{x_i}\) can be any real number: |
\(Constraint:\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1 -\ell(y_i;\vec{x_i},\vec{\omega},b),\ for\ all\ 1 \le i \le n\) |
4. By knowing 2d is inversely proprotional to \(\vec{\omega}\), we need to balance between maximizing the margin between both classes and minimizing the amount of miss-classifications. So, we introduce a regularization parameter of \(\lambda\) that serves as a degree of importance that is given to miss-classifications. Then, the loss function becomes: |
\(\begin{array}{} Maximize\ d \propto \frac{1}{\|\vec{\omega}\|^{*}} \Leftrightarrow\\ Minimize\ \|\vec{\omega}\|^*=[\frac{1}{n}\sum_{i=1}^n \ell(y_i;\vec{x_i},\vec{\omega},b)] +\lambda \|\vec{\omega}\|^2 \end{array}\) |
| 5. We want to maximize the distance between the hyperplanes by minimizing the loss function : | \(\begin{array}{}Minimize\ \|\vec{\omega}\|^*=[\frac{1}{n}\sum_{i=1}^n \ell(y_i;\vec{x_i},\vec{\omega},b)] +\lambda \|\vec{\omega}\|^2\\s.t.\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1- \ell(y_i;\vec{x_i},\vec{\omega},b)\},\ for\ all\ 1 \le i \le n\end{array}\) |
6. With a given \(\lambda\), the result of \(\vec{\omega}\) and \(b\) determine the classifier using a sign function, sgn: |
\(\vec{x} \mapsto sgn_\lambda(\vec{\omega} \cdot \vec{x} -b)\) |
Check reference for the regularization parameter, \(\lambda\).
Primal problem can be solve via Lagrangian function, whereas, dual problem can be solve via Lagrangian dual function. The Lagrangian dual problem is obtained by forming the Lagrangian of a minimization problem by using nonnegative Lagrange multipliers to add the constraints to the objective function, and then solving for the primal variable values that minimize the original objective function. This solution gives the primal variables as functions of the Lagrange multipliers, which are called dual variables, so that the new problem is to maximize the objective function with respect to the dual variables under the derived constraints on the dual variables (including at least the nonnegativity constraints). The primal and dual problem has the follow relation:
primal optimal \(\le\) dual optimalprimal optimal \(=\) dual optimal if one of the primal and the dual have finite optima.The case of svm satisfy the Strong Duality Theorem. It is difficult to sovle te primal problem of the soft margin SVM, because the \(\omega\) is explicitly represented by \(\ell\). Therefore, we rewrite it into the Lagrangian dual problem.
| Step | Mathematical expression |
|---|---|
1. To explicitly represent \(\vec{\omega}\) with \(c_i\) for all \(i\), rewrite the Lagrangian function with \(\vec{\omega}= \sum_{i=1}^n c_iy_i\vec{x_i}\) to Lagrangian dual function. Then optimazation problem with a matrix of unknown variables \(\vec{c}\) can be solve with quadratic programming. |
\(\begin{array}{} maximize\ f(c_1...c_n) = \sum_{i=1}^n c_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^ny_ic_i(x_i \cdot x_j)y_jc_j \\ s.t. \sum_{i=1}^n c_i y_i=0,\ and\ 0 \le c_i \le \frac{1}{1n\lambda}\ for\ all\ i \end{array}\) |
| 2. Plug \(\vec{c}\) in and solve for: | \(\vec{\omega}= \sum_{i=1}^n c_iy_i\vec{x_i}\) |
| 3. Finally, new points can be classified by computing | \(\vec{x} \mapsto sgn_\lambda(\vec{\omega} \cdot \vec{x} -b)\) |
Check reference for duality. Check reference for Quadratic programming
Applying the kernel trick to create nonlinear classifiers to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space.
The following are some common kernel. The \(r\), \(d\), \(\gamma\) are kernel parameters, \(\vec{x_i}\), \(\vec{x_j}\) are arbitrary data point from the input dataset.
| Step | Mathematical expression |
|---|---|
1. All same as sovling the Lagrangian dual function, but using the transformed data vector \(\varphi(\vec{x})\) instead of the original \(\vec{x}\) |
\(\begin{array}{}\begin{aligned}{} maximize\ f(c_1...c_n) & = \sum_{i=1}^n c_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i c_i(\varphi(\vec{x_i}) \cdot \varphi(\vec{x_j}))y_jc_j \\ & = \sum_{i=1}^n c_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i k(\vec{x_i}, \vec{x_j}) y_j c_j \\ \end{aligned} \\ s.t.\ \sum_{i=1}^n c_i y_i=0,\ and\ 0 \le c_i \le \frac{1}{1n\lambda}\ for\ all\ i \end{array}\) |
| 2. Plug \(\vec{c}\) in and solve for classification vector \(\vec{w}\) in the transformed space: | \(\vec{\omega}=\sum_{i=1}^n c_i y_i \varphi(\vec{x_i})\) |
| 3. Finally, new points can be classified by computing | \(\vec{z} \mapsto sgn_\lambda(\vec{\omega} \cdot \varphi(\vec{z}) -b) =sgn([\sum_{i=1}^n c_i y_i k(\vec{x_i},\vec{z})]-b)\) |
Check Reference for kernel method
A kernel function is to calculate the dot product of two data points in higher dimentional feature space. The following derivation explains why we only need the kernel function to understand the geometric property, angle, and distance of two arbitrary data point in some higher dimentiaonal feature space, without knowing transformation function \(\varphi()\):
finitely positive semi-definite function, which the corresponding kernel matrix is positive semi-definite that all elements in \(M_k\) is \(\ge 0\):\[M_k = \left[\begin{array}{rrr} k(\vec{x}_1, \vec{x}_1) & ...& k(\vec{x}_1, \vec{x}_n)\\ ...&...&...\\ k(\vec{x}_n, \vec{x}_1) & ...& k(\vec{x}_n, \vec{x}_n)\\ \end{array}\right] \ge 0\]
dot product of two vector points in mapped feature space via \(\varphi()\):\[\begin{aligned} \langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle & = \langle \vec{z}_i,\vec{z}_j \rangle \\ & = \left[\begin{array}{rrr}z_{i1} & z_{i2} & z_{i3} \end{array}\right] \cdot \left[\begin{array}{r}z_{j1} \\z_{j2} \\ z_{j3}\end{array}\right] \\ & = k(\vec{x_i},\vec{x_j})\\ \end{aligned}\]
distance of two vector points in mapped feature space via \(\varphi()\):\[\begin{aligned} \|\varphi(\vec{x_i})-\varphi(\vec{x_j})\|^2 & = (\varphi(\vec{x_i})-\varphi(\vec{x_j}))^T(\varphi(\vec{x_i})-\varphi(\vec{x_j})) \\ & =(\varphi(\vec{x_i})^T-\varphi(\vec{x_j})^T)(\varphi(\vec{x_i})-\varphi(\vec{x_j})) \\ & = \varphi(\vec{x_i})^T\varphi(\vec{x_i})-2\varphi(\vec{x_i})^T\varphi(\vec{x_j})+\varphi(\vec{x_j})^T\varphi(\vec{x_j}) \\ & = \langle \varphi(\vec{x_i}),\varphi(\vec{x_i}) \rangle -2\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle+\langle \varphi(\vec{x_j}),\varphi(\vec{x_j}) \rangle \\ & = k(\vec{x_i},\vec{x_i})-2k(\vec{x_i},\vec{x_j})+k(\vec{x_j},\vec{x_j}) \\ \end{aligned}\]
angle of two vector points in mapped feature space via \(\varphi()\):\[\begin{aligned} \langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle & = \|\varphi(\vec{x_i})\| \cdot \|\varphi(\vec{x_j})\| cos\theta \\ \Rightarrow cos\theta = & \frac{\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle}{\|\varphi(\vec{x_i})\| \cdot \|\varphi(\vec{x_j})\|}\\ & =\frac{\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle}{\sqrt{\langle \varphi(\vec{x_i}),\varphi(\vec{x_i}) \rangle }\sqrt{\langle \varphi(\vec{x_j}),\varphi(\vec{x_j}) \rangle }} \\ & = \frac{k(\vec{x_j},\vec{x_j})}{\sqrt{k(\vec{x_i},\vec{x_i})}\sqrt{k(\vec{x_j},\vec{x_j})}}\\ \end{aligned}\]
where:
Kernel trick uses kernel functions to operate raw feature vector representation in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space.
\[\vec{\omega} \cdot \vec{x} = \langle \sum_{i=1}^n c_iy_i\vec{x_i}, \vec{x_i} \rangle\]
\[\vec{\omega} \cdot \varphi(\vec{x}) = \langle \sum_{i=1}^n \alpha_i y_i \varphi(\vec{x_i}), \varphi(\vec{x})\rangle = \sum_{i=1}^n \alpha_i y_i \langle \varphi(\vec{x_i}), \varphi(\vec{x_j}) \rangle =\sum_{i=1}^n \alpha_i y_i k(\vec{x_i},\vec{x_j})\]